This report is the first one to document and study the feasability of the automatic quality evaluation of experimental literature investigating bio–nano interactions. The first step of this automatic evaluation is to isolate the section Materials and Methods. The goal is to use later this section only to assess if the characterisation of the nano-materials is done and ebaluate the quality of the articles.
This report contain preliminary analyses and exploration of the data contained in the corpus of text. The first goal of this analyses is to gain some understanding of the structure of the texts inside the corpus of articles and the relations of the lemmas “material(s)” and “method(s)” to this corpus.
The second goal is to investigate how to discriminate the beginning of the section “Materials and methods”. The main problem to identify entry of the section Materials and Methods is that some of this two words can be present in the text of the article (typically “cf” material and methods").
The corpus of text has been created from the 751 articles from the folder “Full Text dev set”, which contain 751 articles converted into txt file format. The others articles are kept unseen to test the efficacy of any other tools developped later in “real life condition”.
Few definitions to frame the problem :
Token : Word form or punctuation symbol. “,”, “(” are tokens, but also “and” or “method”.
Lemma : Lemma or stem of word form. “Materials” and “materials” token have the same lemma “materials”, for example.
Head : Head of the current word, which is either a value of token_id or zero.
A quick exploratory data analysis on the article Abrams, MT et al, 2010, led to think that the the “materials” token from the section material and method has a specific property : is head_token_id is equal to zero, i.e. the “head” of this word is itself (cf example under). This led to think that sections titles of aritcles may have this property. This hypothesis will be test in the first part of this report, and in a later section, for the lemma “materials” and “material” (Co-occurences for materials and material when their head_token_id = 0)
In the later section, we will try differents criteria to isolate some lemmas “materials”, “material”, “methods” and “method”. We will use a technic, co-occurences, to explore the surronding of the differents lemmas in the text and evaluate if this criteria allow to discriminate the beginning of the section materials and methods from the remaining of the article.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is a good way to create informal reports describing data analysis projects as a web page, and a good way to mix code and description in a readable maner. There is even books in this format, ranging from Data Analysis for the Life Sciences to Text Mining with R, A Tidy Approach, so anybody can understand and retake this work. This report is also code, it can be recompiled with new data (including an other model for the annotation of the corpus).
library(udpipe)
library(lattice)
library(wordcloud)
library(igraph)
library(ggraph)
library(ggplot2)
library(dplyr)
The following lines load the corpus of text, already annotated and tokenized :
x <- readRDS(file = "annotation.rds")
x <- as.data.frame(x)
length(unique(x$doc_id))
## [1] 751
Here an example of a token “materials” with a head_token_id = 0 :
x[7467,]
## doc_id paragraph_id sentence_id sentence
## 7467 doc1 599 830 Materials and Methods Animals.
## token_id token lemma upos xpos feats head_token_id
## 7467 1 Materials materials NOUN NNS Number=Plur 0
## dep_rel deps misc
## 7467 root <NA> <NA>
Considering the observation that, in “Materials and Methods” the head_token_ID was 0 for the token “Materials”, one idea was to explore what are, in the corpus of texts, the most common lemma with a head_token_ID equal to zero.
The expected outcome of this analysis could be to retrieve the usual sections title of scientific articles inside the most common words, like Abstract or Results. The goal is to assess if it is a consistent property of the titles of section inside the articles and uncover potential synonyms to “materials and methods” like “experimental section”.
stats <- subset(x, head_token_id == 0) #https://bnosac.github.io/udpipe/docs/doc7.html
stats <- txt_freq(x = stats$lemma)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0", xlab = "Freq")
Nonetheless, it seems that this assumption was quite naive, as lot of token have this property. Let’s filter for specific lemmas that correspond to usual title of section, like abstract of results :
stats<-stats %>% filter(key %in% c("material", "materials", "result", "results", "abstract", "introduction" , "method", "methods", "discussion", "references"))
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Count of lemma for usual sections name with Head_token_id = 0", xlab = "Freq")
stats
## key freq freq_pct
## 1 result 1829 0.3668933422
## 2 method 534 0.1071192153
## 3 materials 449 0.0900684038
## 4 discussion 376 0.0754247658
## 5 introduction 268 0.0537602054
## 6 material 132 0.0264789071
## 7 methods 82 0.0164490181
## 8 abstract 19 0.0038113578
## 9 results 8 0.0016047823
## 10 references 2 0.0004011956
Some section titles seems to have the afored mentionned property. Nonetheless, the number does not match the total number of articles in this corpus (751). To take the example of the token discussion, or some articles does not have a section dicussion, or, more probably, the token discussion does not have the property mentionned earlier. We can answer this question :
occurrences<-which(x$lemma=="discussion")
length(occurrences)
## [1] 891
length(unique(x[occurrences,]$doc_id))
## [1] 703
There is 891 occurrences of the word discussion in all the corpus, and 703 article with this word. It seems really likely that discriminating tokens that are section titles just with a head token ID of zero is not sufficient.
To explore the relationships of the lemmas “material(s)” and “method(s)” with the rest of the corpus, we can analyse what are the most recurents head tokens for the lemmas “material” and “materials”. The goals of the analysis are :
grep_lemma_head_token_id <- function(index){
#catch the lemma corresponding to the head_token_id of the token at the entry "index" of x
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
head_token_id<-occurrence$head_token_id
head_token_id<-as.numeric(head_token_id)
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following line query the lemma of the head_token_id based on the previous parameters
lemma_head_token_id<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[head_token_id],]$lemma
if (head_token_id==0) {lemma_head_token_id=occurrence$lemma}
return(lemma_head_token_id)
}
material_occurrences<-which(x$lemma=="material")
head_token_lemmas<-sapply(material_occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring lemma corresponding to the head_token_id \n for lemma material", xlab = "Freq")
occurrences<-which(x$lemma=="materials")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma materialS with an s", xlab = "Freq")
occurrences<-which(x$lemma=="method")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")
occurrences<-which(x$lemma=="methods")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")
head(stats, 10)
## head_token_lemmas Freq key
## 88 materials 108 materials
## 92 methods 83 methods
## 44 describe 39 describe
## 79 j 29 j
## 95 Mol 20 Mol
## 61 Enzymol 14 Enzymol
## 87 material 11 material
## 126 section 11 section
## 139 synthesis 10 synthesis
## 96 Mol. 9 Mol.
In the next sessions we test differents criteria to discriminate the lemmas “materials” and “material” inside the articles. The idea is to find a criteria that allow to identify the beginning of the section “materials and methods”.
Co-occurrence is an analysis that allow to see how words are used either in the same sentence or next to each other. We will use this approach to have a sense of what is the neighbourhood of the lemmas we isolated based on each criteria.
There is several type of cooccurrences analysis : * Looking at which words are located in the same document/sentence/paragraph. * Looking at which words are followed by another word. * Looking at which words are in the neighbourhood of the word as in follows the word within skipgram number of words.
Cf doc of the package Updipe for the three possible use. We will use the second approach, as it is the most relevant to our goal and as it is the most simple to interpret. Differents skipgram can be used to got an idea of the distance or more proximal neighbourhood.
The two function above are meant to gain some place in the document. The first one plot the word network, a common technique to visualise word cooccurrences, after the filtration of the cooccurrences that concerns only the lemma of interrest.
plot_cooccurrence <- function(stats, lemma, title){
#function to gain place and make this Rmarkdown document more clear
stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
wordnetwork <- head(stats, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
geom_node_text(aes(label = name), col = "blue", size = 5) +
theme_graph(base_family = "Helvetica") +
theme(legend.position = "none") +
labs(title = title)
}
head_cooc <- function(stats, lemma){
#function to gain place and make this Rmarkdown document more clear
stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
head(stats, 30)
}
stats <- cooccurrence(x = x$lemma, skipgram = 0)
Bigger skipgram were not really relevant. Here we can simply count the elements of the dataframe stats to see how many times each word follow each other.
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 materials and 591
## 2 . materials 513
## 3 materials , 91
## 4 of materials 80
## 5 materials . 78
## 6 materials be 71
## 7 materials & 65
## 8 materials research 59
## 9 applied materials 58
## 10 these materials 46
## 11 materials Science 43
## 12 Biomedical materials 37
## 13 methods materials 35
## 14 the materials 32
## 15 BIOMEDICAL materials 30
## 16 Amorphous materials 28
## 17 and materials 23
## 18 materials have 22
## 19 method materials 22
## 20 materials science 19
## 21 materials Chemistry 19
## 22 see materials 19
## 23 materials in 18
## 24 Proteineous materials 17
## 25 in materials 16
## 26 materials for 15
## 27 Supplementary materials 15
## 28 advanced materials 15
## 29 materials ( 14
## 30 Hazardous materials 13
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material . 376
## 2 material , 305
## 3 material be 278
## 4 material and 216
## 5 the material 214
## 6 of material 209
## 7 test material 195
## 8 material ( 141
## 9 material in 137
## 10 material at 130
## 11 supplementary material 106
## 12 Supplementary material 94
## 13 these material 89
## 14 this material 74
## 15 bulk material 73
## 16 material for 67
## 17 material that 60
## 18 and material 57
## 19 nanotube material 56
## 20 foreign material 51
## 21 reference material 48
## 22 material with 47
## 23 material have 44
## 24 material : 40
## 25 material on 39
## 26 size material 38
## 27 . material 33
## 28 material [ 32
## 29 material the 32
## 30 material to 32
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods")
head_cooc(stats, lemma="methods")
## term1 term2 cooc
## 1 and methods 174
## 2 . methods 138
## 3 methods . 71
## 4 in methods 42
## 5 methods for 39
## 6 methods materials 35
## 7 methods Mol 28
## 8 Immunol methods 28
## 9 methods Enzymol 28
## 10 Mech methods 21
## 11 methods ) 20
## 12 methods , 18
## 13 methods Mol. 17
## 14 , methods 17
## 15 methods ( 16
## 16 methods and 13
## 17 methods synthesis 11
## 18 alternative methods 11
## 19 methods Animals 10
## 20 methods section 10
## 21 methods chemicals 9
## 22 analytical methods 9
## 23 methods in 7
## 24 see methods 7
## 25 revise methods 6
## 26 methods Enzymol. 6
## 27 use methods 6
## 28 methods Nanoparticles 5
## 29 methods to 5
## 30 Test methods 5
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method")
head_cooc(stats, lemma="method")
## term1 term2 cooc
## 1 and method 506
## 2 method . 491
## 3 the method 449
## 4 method for 448
## 5 method of 315
## 6 . method 293
## 7 method be 279
## 8 method , 269
## 9 method to 225
## 10 method ( 203
## 11 method 2.1 139
## 12 method and 130
## 13 method use 128
## 14 method describe 126
## 15 this method 119
## 16 method : 117
## 17 ) method 99
## 18 a method 94
## 19 test method 81
## 20 method [ 80
## 21 method in 77
## 22 method have 60
## 23 method as 52
## 24 method ) 46
## 25 sensitive method 46
## 26 method with 44
## 27 vitro method 39
## 28 analytical method 37
## 29 method that 35
## 30 : method 35
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 materials and 591
## 2 . materials 513
## 3 and method 506
## 4 method . 491
## 5 the method 449
## 6 method for 448
## 7 material . 376
## 8 method of 315
## 9 material , 305
## 10 . method 293
## 11 method be 279
## 12 material be 278
## 13 method , 269
## 14 method to 225
## 15 material and 216
## 16 the material 214
## 17 of material 209
## 18 method ( 203
## 19 test material 195
## 20 and methods 174
## 21 material ( 141
## 22 method 2.1 139
## 23 . methods 138
## 24 material in 137
## 25 method and 130
## 26 material at 130
## 27 method use 128
## 28 method describe 126
## 29 this method 119
## 30 method : 117
Similar to the previous approach, we want to explore the relationships of the differents lemma with their neighbourhood in the corpus of text, but we restrict the analysis for sentences for which the lemma material or materials is the head token of itself.
Even if not all the “Materials and Methods” section titles has a “materials” lemma with a head_token_id equal to zero, the opposite could be true.
Here, by restricting to the lemmas “materials” and “material” which have a head_token_id = 0, we can visualize their statistical association with other words and understand if this subsets of token is really delimiting the beginning of section “material and methods”.
The first function allow to filter for sentences where the lemma material or materials is the head. The following lines calculate the co-occurrences and draw the plot as previously.
create_subset_corpus<- function(index){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for materials and material when their head_token_id = 0
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following lines collect the head_token_id and test if is equal to zero
#if so, its output the tokens of the sentences
head_token_id<-occurrence$head_token_id
if (head_token_id==0) {return(strip_corpus(doc_id, sentence_id))}
return()
}
strip_corpus <- function(doc_id, sentence_id){
#this function returns all the lemma of a sentence, in the appropriate format
#the purpose of doing so is to allow for calculation of cooccurence of words inside this sentences
#for this we need all the elements of the sentence
sentence_id<-as.numeric(sentence_id)
subset_article<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id),]
return(subset_article)
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 materials and 344
## 2 . materials 260
## 3 method materials 78
## 4 materials . 28
## 5 materials materials 19
## 6 methods materials 15
## 7 materials & 9
## 8 materials 2 8
## 9 materials for 7
## 10 Mesoporous materials 4
## 11 applied materials 4
## 12 materials 5 4
## 13 materials of 4
## 14 particulate materials 3
## 15 2 materials 3
## 16 materials , 3
## 17 test materials 3
## 18 advanced materials 3
## 19 materials 6 2
## 20 copolymer materials 2
## 21 materials within 2
## 22 Supplementary materials 2
## 23 Biomedical materials 2
## 24 : materials 2
## 25 nanoparticle materials 2
## 26 of materials 2
## 27 nature materials 2
## 28 Antibacterial materials 2
## 29 materials ( 2
## 30 Amorphous materials 2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma materials is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 materials and 344
## 2 and method 280
## 3 . materials 260
## 4 method 2.1 114
## 5 method materials 78
## 6 and methods 49
## 7 method 2 36
## 8 materials . 28
## 9 materials materials 19
## 10 methods materials 15
## 11 method Animal 11
## 12 materials & 9
## 13 materials 2 8
## 14 materials for 7
## 15 & method 6
## 16 method : 6
## 17 methods Animals 5
## 18 method preparation 4
## 19 methods chemicals 4
## 20 Mesoporous materials 4
## 21 methods 2.1 4
## 22 applied materials 4
## 23 materials 5 4
## 24 materials of 4
## 25 particulate materials 3
## 26 2 materials 3
## 27 methods 2 3
## 28 materials , 3
## 29 test materials 3
## 30 advanced materials 3
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when its head_token_id is equal to 0\n when its head_token_id is equal to 0")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material and 20
## 2 material . 19
## 3 supplementary material 14
## 4 Supplementary material 14
## 5 . material 13
## 6 material available 12
## 7 test material 10
## 8 material , 8
## 9 section material 7
## 10 material with 6
## 11 Copyrighte material 6
## 12 material Supplementary 4
## 13 method material 4
## 14 material material 4
## 15 material that 4
## 16 material in 4
## 17 reference material 4
## 18 mesoporous material 4
## 19 material from 3
## 20 material for 3
## 21 material 2 3
## 22 composite material 3
## 23 material as 3
## 24 important material 3
## 25 material ( 2
## 26 material on 2
## 27 material Electronic 2
## 28 material experimental 2
## 29 oxide material 2
## 30 result material 2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma material is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material and 20
## 2 material . 19
## 3 supplementary material 14
## 4 Supplementary material 14
## 5 and method 14
## 6 . material 13
## 7 material available 12
## 8 test material 10
## 9 material , 8
## 10 section material 7
## 11 method 2.1 6
## 12 material with 6
## 13 Copyrighte material 6
## 14 and methods 5
## 15 material Supplementary 4
## 16 materials and 4
## 17 method material 4
## 18 material material 4
## 19 material that 4
## 20 material in 4
## 21 reference material 4
## 22 mesoporous material 4
## 23 material from 3
## 24 material for 3
## 25 material 2 3
## 26 composite material 3
## 27 material as 3
## 28 important material 3
## 29 material ( 2
## 30 . materials 2
occurrences<-which(x$lemma=="methods")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="methods")
## term1 term2 cooc
## 1 . methods 32
## 2 methods Enzymol 14
## 3 methods for 11
## 4 methods . 8
## 5 Mech methods 8
## 6 Immunol methods 7
## 7 methods Mol 6
## 8 revise methods 5
## 9 [ methods 5
## 10 methods ] 5
## 11 assess methods 3
## 12 methods Phys 3
## 13 methods Enzymol. 3
## 14 methods and 3
## 15 methods Mol. 2
## 16 methods in 2
## 17 methods Production 1
## 18 methods ( 1
## 19 methods Enzym. 1
## 20 Enzym. methods 1
## 21 methods 109:55 1
## 22 Enzymol. methods 1
## 23 materials methods 1
## 24 Nat methods 1
## 25 methods 2008;5:763 1
## 26 Standard methods 1
## 27 USA. methods 1
## 28 methods 9 1
## 29 methods 2010;20( 1
## 30 methods General 1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma methods is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 . methods 32
## 2 methods Enzymol 14
## 3 methods for 11
## 4 methods . 8
## 5 Mech methods 8
## 6 Immunol methods 7
## 7 methods Mol 6
## 8 revise methods 5
## 9 [ methods 5
## 10 methods ] 5
## 11 assess methods 3
## 12 methods Phys 3
## 13 methods Enzymol. 3
## 14 methods and 3
## 15 methods Mol. 2
## 16 methods in 2
## 17 methods Production 1
## 18 methods ( 1
## 19 methods Enzym. 1
## 20 Enzym. methods 1
## 21 methods 109:55 1
## 22 Enzymol. methods 1
## 23 and materials 1
## 24 materials methods 1
## 25 Nat methods 1
## 26 methods 2008;5:763 1
## 27 Standard methods 1
## 28 USA. methods 1
## 29 methods 9 1
## 30 methods 2010;20( 1
occurrences<-which(x$lemma=="method")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="method")
## term1 term2 cooc
## 1 . method 168
## 2 method for 108
## 3 method : 79
## 4 : method 45
## 5 method . 44
## 6 method to 42
## 7 method method 29
## 8 method of 26
## 9 method in 16
## 10 method , 14
## 11 method use 13
## 12 ) method 13
## 13 the method 13
## 14 method and 13
## 15 easy method 10
## 16 sensitive method 10
## 17 method 2.1 9
## 18 test method 8
## 19 a method 8
## 20 method ( 8
## 21 simple method 7
## 22 standard method 6
## 23 method the 6
## 24 Statistical method 6
## 25 vitro method 6
## 26 method that 5
## 27 reliable method 5
## 28 method a 4
## 29 efficient method 4
## 30 new method 4
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma method is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 . method 168
## 2 method for 108
## 3 method : 79
## 4 : method 45
## 5 method . 44
## 6 method to 42
## 7 method method 29
## 8 method of 26
## 9 method in 16
## 10 method , 14
## 11 method use 13
## 12 ) method 13
## 13 the method 13
## 14 method and 13
## 15 easy method 10
## 16 sensitive method 10
## 17 method 2.1 9
## 18 test method 8
## 19 a method 8
## 20 method ( 8
## 21 simple method 7
## 22 standard method 6
## 23 method the 6
## 24 Statistical method 6
## 25 vitro method 6
## 26 method that 5
## 27 reliable method 5
## 28 method a 4
## 29 efficient method 4
## 30 new method 4
We could assume that the last occurrence in an article of the lemma “materials” correspond to the section title “material and methods”. As before, we will use co-occurrences see how words are connected to the last occurrence of “materials” in each documents, and see how often it correspond to a “materials and methods” section.
The first two functions select the last occurrence of a word in a document, and got the id of their sentences. A graph showing the connection of words for this subset of sentences is then plot.
create_subset_corpus_last_lemmas <- function(index){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for materials and material when it is the last lemma of the document
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
lemma<-occurrence$lemma
occurrences_in_doc=which(x$doc_id==doc_id & x$lemma==lemma)
last_occurrence=occurrences_in_doc[length(occurrences_in_doc)]
if (last_occurrence==index){return(strip_corpus(doc_id, sentence_id))}
return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when it is the last lemma of the document")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 materials and 306
## 2 . materials 242
## 3 materials . 48
## 4 method materials 37
## 5 of materials 24
## 6 materials , 23
## 7 materials be 22
## 8 methods materials 20
## 9 materials materials 14
## 10 materials & 13
## 11 materials research 12
## 12 and materials 12
## 13 materials for 11
## 14 Amorphous materials 11
## 15 materials in 10
## 16 materials Science 10
## 17 materials science 9
## 18 the materials 9
## 19 Supplementary materials 9
## 20 these materials 8
## 21 nanosized materials 8
## 22 applied materials 7
## 23 BIOMEDICAL materials 7
## 24 materials Inc. 7
## 25 other materials 6
## 26 nanoscale materials 6
## 27 materials to 6
## 28 Biomedical materials 5
## 29 all materials 5
## 30 see materials 5
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when materials is the last lemma of the document")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 materials and 306
## 2 . materials 242
## 3 and method 210
## 4 and methods 75
## 5 method 2.1 56
## 6 materials . 48
## 7 method materials 37
## 8 of materials 24
## 9 materials , 23
## 10 materials be 22
## 11 methods materials 20
## 12 materials materials 14
## 13 materials & 13
## 14 method Animal 12
## 15 materials research 12
## 16 and materials 12
## 17 materials for 11
## 18 method 2 11
## 19 Amorphous materials 11
## 20 materials in 10
## 21 materials Science 10
## 22 method chemicals 10
## 23 materials science 9
## 24 the materials 9
## 25 Supplementary materials 9
## 26 these materials 8
## 27 nanosized materials 8
## 28 methods . 8
## 29 method Characterization 8
## 30 methods Animals 7
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when it is the last lemma of the document")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 of material 99
## 2 material . 92
## 3 material at 86
## 4 material be 53
## 5 material , 53
## 6 material and 48
## 7 Supplementary material 43
## 8 the material 39
## 9 this material 29
## 10 nanotube material 27
## 11 material in 23
## 12 material : 23
## 13 material ( 22
## 14 material available 22
## 15 test material 19
## 16 supplementary material 19
## 17 material for 17
## 18 and material 17
## 19 reference material 10
## 20 / material 10
## 21 material from 9
## 22 in material 9
## 23 material / 9
## 24 genetic material 9
## 25 these material 9
## 26 material characterization 8
## 27 a material 8
## 28 material refer 8
## 29 adequate material 8
## 30 material as 7
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when material is the last lemma of the document")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 of material 99
## 2 material . 92
## 3 material at 86
## 4 material be 53
## 5 material , 53
## 6 material and 48
## 7 Supplementary material 43
## 8 the material 39
## 9 this material 29
## 10 and method 28
## 11 nanotube material 27
## 12 material in 23
## 13 material : 23
## 14 material ( 22
## 15 material available 22
## 16 test material 19
## 17 supplementary material 19
## 18 material for 17
## 19 and material 17
## 20 method , 12
## 21 : method 11
## 22 reference material 10
## 23 / material 10
## 24 material from 9
## 25 in material 9
## 26 material / 9
## 27 genetic material 9
## 28 these material 9
## 29 materials and 8
## 30 material characterization 8
create_subset_corpus <- function(index, target){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for lemma materials and material when they are the first lemma of a sentence
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following line query the first lemma of the sentence in the good document
first_lemma<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[1],]$lemma
if (first_lemma==target) {return(strip_corpus(doc_id, sentence_id))}
return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
target="materials")
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for lemma materials when it is the first lemma of a sentence")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 materials and 393
## 2 . materials 264
## 3 method materials 124
## 4 methods materials 50
## 5 materials materials 30
## 6 materials . 23
## 7 materials Science 10
## 8 : materials 10
## 9 materials & 9
## 10 materials be 7
## 11 Amorphous materials 6
## 12 materials science 5
## 13 material materials 5
## 14 materials Chitosan 5
## 15 Animals materials 5
## 16 nanoparticle materials 5
## 17 materials 5 5
## 18 materials , 5
## 19 materials for 4
## 20 chemicals materials 4
## 21 materials PTX 4
## 22 materials Pristine 4
## 23 663 materials 4
## 24 Nanoparticles materials 4
## 25 of materials 4
## 26 Characterization materials 3
## 27 , materials 3
## 28 ) materials 3
## 29 materials ) 3
## 30 animal materials 2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 materials and 393
## 2 . materials 264
## 3 and method 261
## 4 method materials 124
## 5 and methods 115
## 6 methods materials 50
## 7 materials materials 30
## 8 materials . 23
## 9 method Animal 21
## 10 method 2.1 13
## 11 method preparation 11
## 12 method chemicals 11
## 13 method material 10
## 14 materials Science 10
## 15 : materials 10
## 16 materials & 9
## 17 method Characterization 9
## 18 methods Animals 8
## 19 method : 8
## 20 materials be 7
## 21 method synthesis 6
## 22 & method 6
## 23 Amorphous materials 6
## 24 materials science 5
## 25 material materials 5
## 26 methods chemicals 5
## 27 materials Chitosan 5
## 28 Animals materials 5
## 29 nanoparticle materials 5
## 30 method Reagent 5
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
target="material")
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for lemma material when it is the first lemma of a sentence")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material and 20
## 2 . material 17
## 3 methods material 5
## 4 material on 4
## 5 method material 3
## 6 test material 2
## 7 material once 2
## 8 material treatment 2
## 9 material -induced 2
## 10 material & 2
## 11 Test material 2
## 12 Characterization material 2
## 13 Organisms material 1
## 14 material supply 1
## 15 characterization material 1
## 16 material characterization 1
## 17 validation material 1
## 18 material engineering 1
## 19 Reagent material 1
## 20 study material 1
## 21 material composition 1
## 22 altered material 1
## 23 material investigation 1
## 24 be material 1
## 25 component material 1
## 26 material material 1
## 27 material Implanting 1
## 28 Implanting material 1
## 29 material deposition 1
## 30 material , 1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material and 20
## 2 . material 17
## 3 and methods 11
## 4 and method 6
## 5 methods material 5
## 6 material on 4
## 7 method material 3
## 8 test material 2
## 9 material once 2
## 10 method Animal 2
## 11 material treatment 2
## 12 material -induced 2
## 13 material & 2
## 14 & method 2
## 15 methods Test 2
## 16 Test material 2
## 17 Characterization material 2
## 18 methods . 1
## 19 methods Silica 1
## 20 Organisms material 1
## 21 material supply 1
## 22 characterization material 1
## 23 material characterization 1
## 24 validation material 1
## 25 material engineering 1
## 26 methods Cytotoxicity 1
## 27 Reagent material 1
## 28 study material 1
## 29 methods preparation 1
## 30 a method 1